Collaborative hunting in artificial agents with deep reinforcement learning

Collaborative hunting, in which predators play different and complementary roles to capture prey, has been traditionally believed to be an advanced hunting strategy requiring large brains that involve high-level cognition. However, recent findings that collaborative hunting has also been documented in smaller-brained vertebrates have placed this previous belief under strain. Here, using computational multi-agent simulations based on deep reinforcement learning, we demonstrate that decisions underlying collaborative hunts do not necessarily rely on sophisticated cognitive processes. We found that apparently elaborate coordination can be achieved through a relatively simple decision process of mapping between states and actions related to distance-dependent internal representations formed by prior experience. Furthermore, we confirmed that this decision rule of predators is robust against unknown prey controlled by humans. Our computational ecological results emphasize that collaborative hunting can emerge in various intra- and inter-specific interactions in nature, and provide insights into the evolution of sociality.


Introduction
Cooperation among animals often provides fitness benefits to individuals in a competitive natural environment (Smith, 1982;Axelrod and Hamilton, 1981).Cooperative hunting, in which two or more individuals engage in a hunt to successfully capture prey, has been regarded as one of the most widely distributed forms of cooperation in animals (Packer and Ruttan, 1988), and has received considerable attention because of the close links between cooperative behavior, its apparent cognitive demand, and even sociality (Macdonald, 1983;Creel and Creel, 1995;Brosnan et al., 2010;Lang and Farine, 2017).Cooperative hunts have been documented in a wide variety of species (Lang and Farine, 2017;Bailey et al., 2013), yet 'collaboration' (or 'collaborative hunting'), in which predators play different and complementary roles, has been reported in only a handful of vertebrate species (Stander, 1992;Boesch and Boesch, 1989;Gazda et al., 2005).For instance, previous studies have shown that mammals such as lions and chimpanzees are capable of dividing roles among individuals, such as when chasing prey or blocking the prey's escape path, to facilitate capture by the group eLife digest From wolves to ants, many animals are known to be able to hunt as a team.This strategy may yield several advantages: going after bigger preys together, for example, can often result in individuals spending less energy and accessing larger food portions than when hunting alone.However, it remains unclear whether this behavior relies on complex cognitive processes, such as the ability for an animal to represent and anticipate the actions of its teammates.It is often thought that 'collaborative hunting' may require such skills, as this form of group hunting involves animals taking on distinct, tightly coordinated roles -as opposed to simply engaging in the same actions simultaneously.
To better understand whether high-level cognitive skills are required for collaborative hunting, Tsutsui et al. used a type of artificial intelligence known as deep reinforcement learning.This allowed them to develop a computational model in which a small number of 'agents' had the opportunity to 'learn' whether and how to work together to catch a 'prey' under various conditions.To do so, the agents were only equipped with the ability to link distinct stimuli together, such as an event and a reward; this is similar to associative learning, a cognitive process which is widespread amongst animal species.
The model showed that the challenge of capturing the prey when hunting alone, and the reward of sharing food after a successful hunt drove the agents to learn how to work together, with previous experiences shaping decisions made during subsequent hunts.Importantly, the predators started to exhibit the ability to take on distinct, complementary roles reminiscent of those observed during collaborative hunting, such as one agent chasing the prey while another ambushes it.
Overall, the work by Tsutsui et al. challenges the traditional view that only organisms equipped with high-level cognitive processes can show refined collaborative approaches to hunting, opening the possibility that these behaviors may be more widespread than originally thought -including between animals of different species.Notably, our predator agents successfully learned to collaborate in capturing their prey solely through a reinforcement learning algorithm, without employing explicit mechanisms comparable to aspects of theory of mind (Yoshida et al., 2008;Foerster, 2019;Hu and Foerster, 2020).Moreover, our results showed that the acquisition of decision rules resulting in collaborative hunting is facilitated by a combination of two factors: the difficulty of capturing prey during solitary hunting, and food (i.e.reward) sharing following capture.We also found that decisions underlying collaborative hunts were related to distance-dependent internal representations formed by prior experience.Furthermore, the decision rules worked robustly against unknown prey controlled by humans.These provide insight that collaborative hunts do not necessarily require sophisticated cognitive mechanisms, and simple decision rules based on mappings between states and actions can be practically useful in nature.Our results support the recent suggestions that the underlying processes facilitating collaborative hunting can be relatively simple (Lang and Farine, 2017).

Results
We set out to model the decision process of predators and prey in an interactive environment.In this study, we focused on a chase and escape scenario in a two-dimensional open environment.Chase and escape is a potentially complex phenomenon in which two or more agents interact in environments that change from moment to moment.Nevertheless, many studies have shown that the rules of chase/ escape behavior (e.g. which direction to move at each time in a given situation) can be described by relatively simple mathematical models consisting of the current state (e.g.positions and velocities) (Brighton et al., 2017;Tsutsui et al., 2020;Howland, 1974).We, therefore, considered modeling the agent's decision process in a standard reinforcement learning framework for a finite Markov decision process in which each sequence is a distinct state.In this framework, the agent interacts with the environment through a sequence of states, actions, and rewards, and aims to select actions in a manner that maximizes cumulative future reward (Sutton and Barto, 2018).
We modeled an agent (predator/prey) with independent learning, which is one of the simplest approaches to multi-agent reinforcement learning (Tan, 1993;Figure 1a).In this approach, each agent independently learns its own policy and treats the other agents as part of the environment.In other words, each agent learns policies that are conditioned only on its local observation history, and does not account for the non-stationarity of the multi-agent environment.That is, in contrast to previous studies on multi-agent reinforcement learning (Yoshida et al., 2008;Foerster, 2019;Hu and Foerster, 2020;Tesauro, 2003;Foerster et al., 2016;Silver et al., 2017;Lowe, 2017;Foerster et al., 2018;Sunehag, 2017;Rashid, 2020;Son et al., 2019;Baker, 2019;Christianos et al., 2020;Mugan and MacIver, 2020;Hamrick, 2021;Yu, 2022), our agents did not infer the mental states of others, did not share network parameters and value functions, and did not access models of the environment for planning.For each agent n , the policy π n was represented by a neural network and optimized using the deep Q-network framework (Mnih et al., 2015) (see Methods).The inputs to the neural network are the positions of a specific agent in the absolute coordinate system and the positions and velocities of a specific agent and others in the relative coordinate system with respect to the prey (or the nearest predator), which are determined based on findings in neuroscience (O' Keefe and Dostrovsky, 1971) and ethology (Brighton et al., 2017;Tsutsui et al., 2020), respectively.The outputs are the acceleration in 12 directions every 30° in the relative coordinate system, which is determined with reference to an ecological study (Wilson et al., 2018).We assumed that delays in sensorimotor processing would be compensated for by estimation of the motion of self (Wolpert et al., 1998;Kawato, 1999) and others (Tsutsui et al., 2021), and the current information at each time step was taken as input as is.The play area size was constrained to a range of -1 to 1 on the x and y axes, and the initial positions of the predators and prey in each episode were randomly selected from a range of -0.5 to 0.5 on the x and y axes.All agents (predators/prey) were represented as a disk and the diameters were set to 0.1.The predator(s) were rewarded for capturing the prey (+1), namely contacting the disks, and punished for moving out of the area (-1), and the prey was punished for being captured by the predator or for moving out of the area (-1).The time step was 0.1 and the time limit in each episode was set to 30 s.During the evaluation phase, if the predator captured the prey within the time limit, the predator was deemed successful; otherwise, the prey was considered successful.Additionally, if one side (predators/prey) moved out of the area, the other side (prey/predators) was deemed successful.

Exploring the conditions under which collaborative hunting emerges
We first performed computational simulations with three experimental conditions to investigate the conditions under which collaborative hunting emerges (Figure 1b; Videos 1-3).As experimental conditions, we selected the number of predators, relative mobility, and prey (reward) sharing based on ecological findings (Bailey et al., 2013;Lang and Farine, 2017).For the number of predators, three conditions were set: 1 (one), 2 (two), and 3 (three).In all these conditions, the number of prey was set to 1.For the relative mobility, three conditions were set: 120% (fast), 100% (equal), and 80% (slow), which represented the acceleration of the predator, based on that of the prey.For the prey sharing, two conditions were set: with sharing (shared), in which all predators were rewarded when a predator catches the prey, and without sharing (individual), in which a predator was rewarded only when it catches the prey by itself.In total, there were 15 conditions.
As the example trajectories show, under the fast and equal conditions, the predators often caught their prey shortly after the episode began, whereas under the slow condition, the predators somewhat struggled to catch their prey (Fig. 1b).To evaluate their behavior, we calculated the proportion of predations that were successful and mean episode duration.For the fast and equal conditions, predations were successful in almost all episodes, regardless of the number of predators and the presence or absence of reward sharing (e.g.0.99 ± 0.00 for the one × fast and one × equal conditions; Figure 2-figure supplement 1).This indicates that in situations where predators were faster than or equal in speed to their prey, they almost always succeeded in capturing the prey, even when they were the sole predator.Although the mean episode duration decreased with an increasing number of predators in both fast and equal conditions, the difference was small.As a whole, these results indicate that there is little benefit of cooperation among multiple predators in the fast and equal conditions.As it is unlikely that cooperation among predators will emerge under such conditions in nature from an evolutionary perspective (Smith, 1982;Axelrod and Hamilton, 1981), the analysis below is limited to the slow condition.For the slow condition, a solitary predator was rarely successful, and the proportion of predations that were successful increased with the number of predators (Figure 2a).Moreover, the mean duration decreased with an increasing number of predators (Figure 2a bottom).These results indicate that, under the slow condition, the benefits of cooperation among multiple predators are significant.In addition, except for the two × individual condition, the increase in the proportion of success with an increasing number of predators was much greater than the theoretical prediction (Packer and Ruttan, 1988), calculated based on the proportion of solitary hunting, assuming that each predator's performance is independent of the others' (see Methods).These results indicate that under these conditions, elaborate hunting behavior (e.g.'collaboration') that is qualitatively different from hunting alone may emerge.
Then, we examined agent behavioral patterns and found that there were differences in the movement paths that predators take to catch their prey among the conditions (Figure 2b).As shown in the typical example, under the individual condition, both predators moved in a similar manner toward their prey (Figure 2b  at each location in the area (Figure 2c).We found that there was a noticeable difference between the individual and shared reward conditions.In the individual condition, the heat maps of prey and respective predators were quite similar (Figure 2c), whereas this was not always the case in the shared condition (Figure 2c).In particular, the heat maps of predator 2 in the two-predator condition and predator 3 in the three-predator condition showed localized concentrations (Figure 2c far right, respectively).To assess these differences among predators in more detail, we compared the predators' decisions (i.e.action selections) in these conditions with that in the one-predator condition (i.e.solitary hunts) using two indices, concordance rate, and circular correlation (Berens, 2009;Figure 2figure supplement 2).Following previous studies (Scheel and Packer, 1991), we also calculated the ratios of distance moved during hunting among predators (Figure 2-figure supplement 3).Overall, these findings support the idea that predators with heat maps similar to their prey acted as 'chasers' (or 'drivers'), while predators with different heat maps behaved as 'blockers' (or 'ambushers').That is, our results show that, although most predators acted as chasers, some predators acted as blockers rather than chasers in the shared condition, indicating the emergence of collaborative hunting characterized by role divisions among predators under the condition.

Mechanistic interpretability of collaboration
We next sought the predators' internal representations to better understand how such collaborative hunting is accomplished.Using a two-dimensional t-distributed stochastic neighbor embedding (t-SNE) (van der Maaten and Hinton, 2008), we visualized the last hidden layers of the state and action streams in the policy network as internal representations of agents (Figure 3, Figure 3-figure supplements 1-3).To understand how each agent represents its environment and what aspects of the state are well represented, we examined the relationship between the scenes of a typical scenario and their corresponding points on the embedding (Figure 3a and b).As expected, when the predator is likely to catch its prey (e.g.scene 4), the predator estimated a higher state value, whereas, when the predator is not (e.g.scene 5), the predator estimated a lower state value (Figure 3a top).Related to this, the variance of action values tends to be larger for both predator and prey when they are close (Figure 3a bottom), indicating that the difference in the value of choosing each action is greater when the choice of action is directly related to the reward (see also For both panels, quantitative data denote the mean of 100 episodes ± SEM across 10 random seeds.The error bars are barely visible because the variation is negligible.The theoretical prediction values were calculated based on the proportion of solitary hunts (see Methods).The proportion of predations that were successful increased as the number of predators increased ( F number(2,18) = 1346.67,p <0.001; η 2 = 0.87; one vs.two: t (9) = 20.38,p <0.001; two vs. three: t (9) = 38.27,p <0.001).The mean duration decreased with increasing number of predators ( F number(2,18) = 1564.01,p <0.001; η 2 = 0.94; one vs.two: t (9) = 15.98,p <0.001; two vs. three: t (9) = 40.65,p <0.001).(b) Typical example of different predator routes between the individual (left) and shared (right) conditions, in the two-predator condition.The numbers (1-3) show a series of state transitions (every second) starting from the same initial position.Each panel shows the agent positions and the trajectories leading up to that state.In these instances, the predators ultimately failed to capture the prey within the time limit (30 s) under the individual condition, whereas the predators successfully captured the prey in only 3 s under the shared condition.(c) Comparison of heat maps between individual (left) and shared (right) reward conditions.The heat maps of each agent were constructed based on the frequency of stay in each position, which was cumulative for 1000 episodes (100 episodes × 10 random seeds).In the individual condition, there were relatively high correlations between the heat maps of the prey and each predator, regardless of the number of predators (One: r =0.95, p <0.001, Two: r =0.83, p <0.001 in predator 1, r =0.78, p <0.001 in predator 2, Three: r =0.41, p <0.001 in predator 1, r =0.56, p <0.001 in predator 2, r =0.45, p <0.001 in predator 3).In contrast, in the shared condition, only one predator had a relatively high correlation, whereas the others had low correlations (Two: r =0.65, p <0.001 in predator 1, r =0.01, p =0.80 in predator 2, Three: r =0.17, p <0.001 in predator 1, r =0.54, p <0.001 in predator 2, r =0.03, p =0.23 in predator 3).
The online version of this article includes the following figure supplement(s) for figure 2:    The number (1-5) in each embedding corresponds to a selected series of state transitions.The series of agent positions in the state transitions (every second) and, for ease of visibility, the trajectories leading up to that state are shown.(c) Embedding colored according to the distances between predators and prey in the individual (left) and shared (right) reward conditions.Distances 1 and 2 denote the distances between predator 1 and prey and predator 2 and prey, respectively.If both distances are short, the point is colored blue; if both are long, it is colored white.
The online version of this article includes the following figure supplement(s) for figure 3:          results suggest that the agents were able to learn the networks that output the estimations of state and action values consistent with our intuition.
Furthermore, we found a distinct feature in the embedding of predators' representations.Specifically, in certain state transitions, the position of the points on the embedding changed little, even though the agents were moving (e.g.scenes 1-2 on the embedding of the predator 2).From this, we deduced that the predators' representations may be more focused on encoding the distance between themselves and others, rather than the specific locations of both parties.To test our reasoning, we colored the representations according to the distance between predators and prey; distance 1 denotes the distance between predator 1 and the prey, and distance 2 denotes that between predator 2 and the prey.As a result, the representations of predators in the shared condition could be clearly separated by the distance-dependent coloration (Fig. 3c  Evaluating the playing strength of predator agents using joint play with humans Finally, to verify the generality of predators' decisions against unknown prey, we conducted an experiment of joint play between agents and humans.In the joint play, human participants controlled prey on a screen using a joystick.The objective, as in the computational simulation described above, was to evade capture until the end of the episode (30 s) while remaining within the area.We found that    the outcomes of the joint play showed similar trends to those of the computer simulation (Figure 4a), showing that the proportion of predations that were successful increased and the mean episode duration decreased as the number of predators increased.These indicate that the predator agents' decision rules worked well for the prey controlled by humans.To visualize the associations of states experienced by predator agents versus agents and versus humans, we show colored two-dimensional t-SNE embedding of the representations in the last hidden layers of the state and action streams (Figure 4b, Figure 4-figure supplement 1).These showed that, in contrast to a previous study (Mnih et al., 2015), the states were quite distinct, suggesting that predator agents experienced unfamiliar states when playing against the prey controlled by humans.This unfamiliarity may make it difficult for predators to make proper decisions.Indeed, in the one-predator condition, the predator agent occasionally exhibited odd behavior (e.g.staying in one place; see Figure 4-figure supplement 2).On the other hand, in the two-and three-predator conditions, predator agents rarely exhibited such behavior and showed superior performance.This indicates that decision rules of cooperative hunting acquired in certain environments could be applied in other somewhat different environments.

Discussion
Collaborative hunting has been traditionally thought of as an advanced hunting strategy that involves high-level cognition such as aspects of theory of mind (Boesch and Boesch-Achermann, 2000;Boesch, 2002).Here, we have shown that 'collaboration' (Boesch and Boesch, 1989) can emerge in group hunts of artificial agents based on deep reinforcement learning.Notably, our predator agents successfully learned to collaborate in capturing their prey solely through a reinforcement learning algorithm, without employing explicit mechanisms comparable to aspects of theory of mind (Yoshida et al., 2008;Foerster, 2019;Hu and Foerster, 2020).This means that, in contrast to the traditional view, apparently elaborate coordination can be accomplished by relatively simple decision rules, that is, mappings between states and actions.This result advances our understanding of cooperative hunting behavior and its decision process, and may offer a novel perspective on the evolution of sociality.
Our results on agent behavior are broadly consistent with previous studies concerning observations of animal behavior in nature.First, as the number of predators increased, success rates increased and hunting duration decreased (Creel and Creel, 1995).Second, whether collaborative hunts emerge depended on two factors: the success rate of hunting alone (Busse, 1978;Boesch, 2002) and the presence or absence of reward sharing following prey capture (Boesch, 1994;Stanford, 1996).Third, while each predator generally maintained a consistent role during repeated collaborative hunts, there was flexibility for these roles to be swapped as needed (Stander, 1992;Boesch, 2002).Finally, predator agents in this study acquired different strategies depending on the conditions despite having exactly the same initial values (i.e.network weights), resonating with the findings that lions and chimpanzees living in different regions exhibit different hunting strategies (Stander, 1992;Boesch-Achermann and Boesch, 1994).These results suggest the validity of our computational simulations and highlight the close link between predators' behavioral strategies and their living environments, such as the presence of other predators and sharing of prey.
The collaborative hunts have shown performance that surpasses the theoretical predictions based on solitary hunting outcomes.This result is in line with the notion that role division among predators in nature could provide fitness benefits (Lang and Farine, 2017;Boesch and Boesch-Achermann, 2000).Meanwhile, when three predators were involved, performance was comparable whether prey was shared or not.One possible factor that has caused this is spatial constraints.We found that predators occasionally block the prey's escape path, exploiting the boundaries of the play area and the chasing movements of other predators even in the individual reward condition (Figure 2-figure supplement 4).These results suggest that, under certain scenarios, coordinated hunting behaviors that enhance the success rate of predators may emerge regardless of whether food is shared, potentially relating to the benefits of social predation, including interspecific hunting (Bshary et al., 2006;Thiebault et al., 2016;Sampaio et al., 2021).
We found that the mappings resulting in collaborative hunting were related to distance-dependent internal representations.Additionally, we showed that the distance-dependent rule-based predators successfully reproduced behaviors similar to those of the deep reinforcement learning predators, supporting the association between decisions and distances (Methods; Deep reinforcement learning has held the promise for providing a comprehensive framework for studying the interplay among learning, representation, and decision making (Botvinick et al., 2020;Mobbs et al., 2021), but such efforts for natural behavior have been limited (Banino et al., 2018;Jaderberg et al., 2019).Our result that the distance-dependent representations relate to collaborative hunting is reminiscent of a recent idea about the decision rules obtained by observation in fish (Steinegger et al., 2018).Notably, the input variables of predator agents do not include variables corresponding to the distance(s) between the other predator(s) and prey, and this means that the predators in the shared conditions acquired the internal representation relating to distance to prey, which would be a geometrically reasonable indicator, by optimization through interaction with their environment.Our results suggest that deep reinforcement learning methods can extract systems of rules that allow for the emergence of complex behaviors.
The predator agents' decision rules (i.e. policy networks) acquired through interactions with other agents (i.e.self-play) were also useful for unknown prey controlled by humans, despite the dissociation of the experienced states.This suggests that decision rules formed by associative learning can successfully address natural problems, such as catching prey with somewhat different movement patterns than one's usual prey.Note that the learning mechanism of associative learning (or reinforcement learning) is relatively simple, but it allows for flexible behavior in response to situations, in contrast to innate and simple stimulus-response.Indeed, our prey agents achieved a higher rate of successful evasions than those operated by humans.Our view that decisions for successful hunting are made through representations formed by prior experience is a counterpart to the recent idea that computational relevance for successful escape may be cached and ready to use, instead of being computed from scratch on the spot (Evans et al., 2019).If animals' decision processes in predatorprey dynamics are structured in this way, it could be a product of natural selection, enabling rapid, robust, and flexible action in interactions with severe time constraints.
In conclusion, we demonstrated that the decisions underlying collaborative hunting among artificial agents can be achieved through mappings between states and actions.This means that collaborative hunting can emerge in the absence of explicit mechanisms comparable to aspects of theory of mind, supporting the recent idea that collaborative hunting does not necessarily rely on complex cognitive processes in brains (Lang and Farine, 2017).Our computational ecology is an abstraction of a real predator-prey environment.Given that chase and escape often involve various factors, such as energy cost (Hubel et al., 2016), partial observability (Mugan and MacIver, 2020;Hunt et al., 2021), signal communication (Vail et al., 2013), and local surroundings (Evans et al., 2019), these results are only a first step on the path to understanding real decisions in predator-prey dynamics.Furthermore, exploring how mechanisms comparable to aspects of theory of mind (Yoshida et al., 2008;Foerster, 2019;Hu and Foerster, 2020) or the shared value functions (Lowe, 2017;Foerster et al., 2018;Rashid, 2020), which are increasingly common in multi-agent reinforcement learning, play a role in these interactions could be an intriguing direction for future research.We believe that our results provide a useful advance toward understanding natural value-based decisions and forge a critical link between ecology, ethology, psychology, neuroscience, and computer science.

Methods Environment
The predator and prey interacted in a two-dimensional world with continuous space and discrete time.This environment was constructed by modifying an environment known as 'predator-prey' within a multi-agent particle environment (Lowe, 2017).Specifically, the position of each agent was calculated by integrating the acceleration (i.e.selected action) twice with the Euler method, and viscous resistance proportional to velocity was considered.The modifications were that the action space (play area size) was constrained to the range of -1 to 1 on the x and y axes, all agent (predator/prey) disk diameters were set to 0.1, landmarks (obstacles) were eliminated, and predator-to-predator contact was ignored for simplicity (Tsutsui et al., 2022).The predator(s) was rewarded for capturing the prey (+1), namely contacting the disks, and punished for moving out of the area (-1), and the prey was penalized for being captured by the predator or for moving out of the area (-1).The predator and prey were represented as a red and blue disk, respectively, and the play area was represented as a black square enclosing them.The time step was 0.1 s and the time limit in each episode was set to 30 s.The initial positions of the predators and prey in each episode were randomly selected from a range of -0.5 to 0.5 on the x and y axes.

Experimental conditions
We selected the number of predators, relative mobility, and prey (reward) sharing as experimental conditions, based on ecological findings (Bailey et al., 2013;Lang and Farine, 2017).For the number of predators, three conditions were set: 1 (one), 2 (two), and 3 (three).In all these conditions, the number of prey was set to 1.For the relative mobility, three conditions were set: 120% (fast), 100% (equal), and 80% (slow) for the acceleration exerted by the predator, based on that exerted by the prey.For the prey sharing, two conditions were set: with sharing (shared), in which all predators were rewarded when a predator catches the prey, and without sharing (individual), in which a predator was rewarded only when it catches prey by itself.In total, there were 15 conditions.

Agent architecture
We considered a sequential decision-making setting in which a single agent interacts with an environment E in a sequence of observations, actions, and rewards.At each time-step t , the agent observes a state st ∈ S and selects an action at from a discrete set of actions A = {1, 2, . . . ,|A|} .One time step later, in part as a consequence of its action, the agent receives a reward, r t+1 ∈ R , and moves itself to a new state s t+1 .In the MDP, the agent thereby gives rise to a sequence that begins as follows: s 0 , a 0 , r 1 , s 1 , a 1 , r 2 , s 2 , a 2 , r 3 , . . ., and learns a behavioral rule (policy) that depends upon these sequences.
The goal of the agent is to maximize the expected discounted return over time through its choice of actions (Sutton and Barto, 2018).The discounted return Rt was defined as ∑ T k=0 γ k r t+k+1 , where γ ∈ [0, 1] is a parameter called the discount rate that determines the present value of future rewards, and T is the time step at which the task terminates.The statevalue function, action-value function, and advantage function are defined as s) , respectively, where π is a policy mapping states to actions.The optimal action-value function Q ⋆ (s, a) is then defined as the maximum expected discounted return achievable by following any strategy, after observing some state s and then taking some action a , Q ⋆ (s, a) = maxπ E[Rt|st = s, at = a, π] .The optimal action-value function can be computed by finding a fixed point of the Bellman equations: where s ′ and a ′ are the state and action at the next time-step, respectively.This is based on the following intuition: if the optimal value Q ⋆ (s ′ , a ′ ) of the state s ′ was known for all possible actions a ′ , the optimal strategy is to select the action a ′ maximizing the expected value of r + γ max a ′ Q ⋆ (s ′ , a ′ ) .The basic idea behind many reinforcement learning algorithms is to estimate the action-value function by using the Bellman equation as an iterative update; . Such value iteration algorithms converge to the optimal action-value function in situations where all states can be sufficiently sampled, In practice, however, it is often difficult to apply this basic approach, which estimates the action-value function separately for each state, to real-world problems.Instead, it is common to use a function approximator to estimate the action-value function, There are several possible methods for function approximation, yet we here use a neural network function approximator referred to as deep Q -network (DQN) (Mnih et al., 2015) and some of its extensions to overcome the limitations of the DQN, namely Double DQN ( Van Hasselt et al., 2016), Prioritized Experience Replay (Schaul et al., 2015), and Dueling Networks (Wang, 2016).Naively, a Q -network with weights θ can be trained by minimizing a loss function L(θ) that changes at each iteration i , where is the target value for iteration i , and ρ(s, a) is a probability distribution over states s and actions a .The parameters from the previous iteration θ i−1 are kept constant when optimizing the loss function L(θ) .By differentiating the loss function with respect to the weights we arrive at the following gradient, We could attempt to use the simplest Q -learning to learn the weights of the network Q(s, a; θ) online; however, this estimator performs poorly in practice.In this simplest form, they discard incoming data immediately, after a single update.This results in two issues: ( ) strongly correlated updates that break the i.i.d.assumption of many popular stochastic gradient-based algorithms and ( ii ) the rapid forgetting of possibly rare experiences that would be useful later.To address both of these issues, a technique called experience replay is often adopted (Lin, 1992), in which the agent's experiences at each time-step et = (st, at, r t+1 , s t+1 ) are stored in a dataset (also referred to as replay memory) D = {e 1 , e 2 , . . ., e N } , where N is the dataset size, for some time period.When training the Q -network, instead of only using the current experience as prescribed by standard Q -learning, mini-batches of experiences are sampled from D uniformly, at random, to train the network.This enables breaking the temporal correlations by mixing more and fewer recent experiences for the updates, and rare experiences will be used for more than just a single update.Another technique, called the target-network, is also often used for updating to stabilize learning.To achieve this, the target value y i is replaced by , where θ − i are the weights, which are frozen for a fixed number of iterations.The full algorithm combining these ingredients, namely experience replay and the target-network, is often called a deep Q-network (DQN), and its loss function takes the form: where and U (•) is a uniform sampling.
It has become known that Q -learning algorithms perform poorly in some stochastic environments.This poor performance is caused by large overestimations of action values.These overestimations result from a positive bias that is introduced because Q -learning uses the maximum action value as an approximation for the maximum expected action value.As a method to alleviate the performance degradation due to the overestimation, Double Q -learning, which decomposes the maximum operation into action selection and action evaluation by introducing the double estimator, was proposed (Hasselt, 2010).Double DQN (DDQN) is an algorithm that applies the Double Q -learning method to DQN ( Van Hasselt et al., 2016).For the DDQN, in contrast to the original Double Q -learning and the other proposed method (Fujimoto et al., 2018), the target network in the DQN architecture, although not fully decoupled, was used as the second value function, and the target value in the loss function (i.e.Eq.Agent architecture) for iteration i is replaced as follows: Prioritized Experience Replay is a method that aims to make the learning more efficient and effective than if all transitions were replayed uniformly (Schaul et al., 2015).For the prioritized replay, the probability of sampling from the data-set for transition i is defined as where p i > 0 is the priority of transition for iteration i and the exponent α determines how much prioritization is used, with α = 0 corresponding to uniform sampling.The priority p i is determined by , where δ i is a temporal-difference (TD) error (e.g. and ϵ is a small positive constant that prevents the case of transitions not being revisited once their error is zero.Prioritized replay introduces sampling bias, and therefore changes the solution to which the estimates will converge.This bias can be corrected by importance-sampling (IS) weights ) β that fully compensate for the non-uniform probabilities P(i) Dueling Network is a neural network architecture designed for value-based algorithms such as DQN (Wang, 2016).This features two streams of computation, the value and advantage streams, sharing a common encoder, and is merged by an aggregation module that produces an estimate of the state-action value function.Intuitively, we can expect the dueling network to learn which states are (or are not) valuable, without having to learn the effect of each action for each state.For the reason of stability of the optimization, the last module of the network is implemented as follows: where θ denotes the parameters of the common layers, whereas η and ξ are the parameters of the layers of the two streams, respectively.We here modeled an agent (predator/prey) with independent learning, one of the simplest approaches to multi-agent reinforcement learning (Tan, 1993).In this approach, each agent independently learns its own policy and treats the other agents as part of the environment.In other words, each agent learns policies that are conditioned only on their local observation history, and do not account for the non-stationarity of the multi-agent environment.That is, in contrast to previous studies on multi-agent reinforcement learning (Tesauro, 2003;Foerster et al., 2016;Silver et al., 2017;Lowe, 2017;Foerster et al., 2018;Sunehag, 2017;Rashid, 2020;Son et al., 2019;Baker, 2019;Christianos et al., 2020;Mugan and MacIver, 2020;Hamrick, 2021;Yu, 2022), our agents did not share network parameters and value functions, and did not access models of the environment for planning.For each agent n , the policy π n is represented by a neural network and optimized, with the framework of DQN including DDQN, Prioritized Experience Replay, and Dueling architecture.The loss function of each agent takes the form: where and P (•) is a prioritized sampling.For simplicity, we omitted the agent index n in these equations.

Training details
The neural network was composed of four layers (Figure 1-figure supplement 1).There was a separate output unit for each possible action, and only the state representation was an input to the neural network.The inputs to the neural network were the positions of a specific agent in the absolute coordinate system ( x -and y -positions) and the positions and velocities of a specific agent and others in the relative coordinate system ( u -and v -positions and u -and v -velocities) (Figure 1-figure supplement 2), which were determined based on findings in neuroscience (O'Keefe and Dostrovsky, 1971) and ethology (Brighton et al., 2017;Tsutsui et al., 2020), respectively.We assumed that delays in sensory processing were compensated for by estimation of motion of self (Wolpert et al., 1998;Kawato, 1999) and others (Tsutsui et al., 2021), and the current information at each time was used as input as is.The outputs were the acceleration in 12 directions every 30 • in the relative coordinate system, which were determined with reference to an ecological study (Wilson et al., 2018).After the first two hidden layers of the MLP with 64 units, the network branched off into two streams.Each branch had one MLP layer with 32 hidden units.ReLU was used as the activation function for each layer (Glorot et al., 2011).The network parameters θ n , η n , and ξ n were iteratively optimized via stochastic gradient descent with the Adam optimizer (Kingma and Ba, 2014).In the computation of the loss, we used Huber loss to prevent extreme gradient updates (Huber, 1992).The model was trained for 10 6 episodes, and the network parameters were copied to the target-network every 2000 episodes.The replay memory size was 10 4 , the minibatch size during training was 32, and the learning rate was 10 −6 .
decisions was limited to the current information (e.g.position and velocity) and the output was provided in a relative coordinate system to the prey; that is, action 1 denotes movement toward the prey and action 7 denotes movement in the opposite direction of the prey.The predator agent first determines whether it, or another predator, is closer to the prey, and then, if the other predator is closer, it determines whether the distance 2 is less than a certain distance threshold (set to 0.4 in our simulation).The decision rule for each predator is selected by this branching, with predator 1 adopting the three rules 'chase,' 'shortcut,' and 'approach,' and predator 2 adopting the two rules 'chase' and 'ambush.'For the chase, the predator first determines whether it is near the outer edge of the play area and, if so, selects actions that will prevent it from leaving the play area.Specifically, if the predator's position is such that |x| > 0.9 and |y| > 0.9, action 3 for clockwise (CW) and action 11 for counterclockwise (CCW) was selected, respectively, and if 0.8 < |x| ≦ 0.9 and 0.8 < |y| ≦ 0.9, action 2 for CW and action 12 for CCW was selected.The CW and CCW were determined by the absolute position of the prey and the relative position vector between the closer predator and prey; the play area was divided into four parts based on the signs of the x and y coordinates, and CW and CCW were determined by the correspondence between each area and the sign of the larger component of absolute value ( x or y ) of the relative position vector.For instance, if the closer predator is at (0.2, 0.3) and the prey is at (0.5, 0.2), it is determined to be CW.If the predator is not outside the play area, then it determines whether the prey is inside the play area, and, if so, selects actions that will drive them outside; if the prey's position is such that |x| ≦ 0.5 and |y| ≦ 0.5, action 11 for CW and action 3 for CCW was selected, and if 0.5 < |x| ≦ 0.6 and 0.5 < |y| ≦ 0.6, action 12 for CW and action 2 for CCW was selected.In other situations, the predator selects actions so that the direction of movement is aligned with that of the prey; if the angle of the velocity vectors between the predator and prey ψ ≦ -50 action 3, and if -50 < ψ ≦ -15 action 2, if -15 < ψ ≦ 15 action 1, if 15 < ψ ≦ 50 action 2, if 50 < ψ action 3 was selected.For the shortcut, the predator determines whether it is near the outer edge of the play area, and if so, selected the action described above, otherwise, it selected actions that producing shorter paths to the prey; action 2 for CW and action 12 for CCW was selected.For the approach, the predator determines whether it is near the outer edge of the play area, and if so, selected the action described above, otherwise, it selected actions that move it toward the prey; action 1 was selected.
For the ambush, the predator selected actions that move it toward the top center or bottom center of the play area and to remain that location until the situation changes.If the predator's position is such that |y| ≦ 0, the predator moved with respect to the bottom center point (-0.1, 0.5), and if |y| > 0, it moved toward the top center point (0, 0.6).The coordinates of the top center and bottom center points were based on the result of deep reinforcement learning agents.Specifically, we first divided the play area into four parts based on the signs of the x and y coordinates with respect to the reference (i.e.bottom center or top center) point, and in each area, the predator selected actions 3, 8, or 12 (every 120 degrees) that will move it toward the reference point, depending on the direction of the prey from the predator's perspective.For instance, if the predator is at (-0.2, 0.8) and the prey is at (−0.2, -0.8), action 12 is selected.

Behavioral cloning
We constructed neural networks to clone the predatory behavior of rule-based agents.The neural network is composed of two weight layers; that is, it takes the state of environments as inputs as in the deep reinforcement learning agents, processes them through a hidden layer, and then outputs probabilities for each of the 13 potential actions using the softmax function.To ensure a fair comparison with the embedding of deep reinforcement learning agents, we set the number of units in the hidden layer to 32.In the networks, all layers were composed of the fully connected layer.In this study, for each agent (i.e.predator 1 and predator 2), we implemented two types of networks: a linear network without any nonlinear transformation, and a nonlinear network with ReLU activations.Specifically, in the linear network, the hidden layer is composed of the fully connected layer without nonlinearity, where x , h , W xh , and b h denote the input to the hidden layer (state), the output of the hidden layer, the input-to-hidden weight, and the bias, respectively.In the nonlinear network, the hidden layer is composed of the fully connected layers with nonlinearity, The following dataset was generated: left) and, in contrast, under the shared condition, one predator moved toward their prey while the other predator moved along a different route (Figure 2b right).To ascertain their behavioral patterns, we created heat maps showing the frequency of agent presence Video 3. Example videos in the three-predator conditions.https://elifesciences.org/articles/85694/figures#video3

Figure 1 .
Figure 1.Agent architecture and examples of movement trajectories.(a) An agent's policy is represented by a deep neural network (see Methods).A state of the environment is given as input to the network.An action is sampled from the network's output, and the agent receives a reward and a subsequent state.The agent learns to select actions that maximize cumulative future rewards.In this study, each agent learned its policy network independently, that is, each agent treats the other agents as part of the environment.This illustration shows a case with three predators.(b) The movement trajectories are examples of interactions between predator(s) (dark blue, blue, and light blue) and prey (red) that overlay 10 episodes in each experimental condition.The experimental conditions were set as the number of predators (one, two, or three), relative mobility (fast, equal, or slow), and reward sharing (individual or shared), based on ecological findings.The online version of this article includes the following figure supplement(s) for figure 1: Figure supplement 1. Network architecture.

Figure supplement 2 .
Figure supplement 2. Diagram of model input.

Figure 2 .
Figure 2. Emergence of collaborations among predators.(a) Proportion of predations that were successful (top) and mean episode duration (bottom).For both panels, quantitative data denote the mean of 100 episodes ± SEM across 10 random seeds.The error bars are barely visible because the variation is negligible.The theoretical prediction values were calculated based on the proportion of solitary hunts (see Methods).The proportion of

Figure supplement 1 .
Figure supplement 1. Proportion of predations that were successful, mean episode duration, and heat maps for each condition.

Figure supplement 3 .
Figure supplement 3. Scaled distance among predators and proportion of prey capture.

Figure supplement 4 .Figure 3 .
Figure supplement 4. Typical example of coordinated hunting behavior in the three × individual condition.

Figure supplement 1 .
Figure supplement 1. Two-dimensional t-distributed stochastic neighbor embedding (t-SNE) embedding of the representations in the last hidden layers of the state-value stream (top) and action-value stream (bottom) in the individual reward condition, in the slow × two conditions.

Figure supplement 2 .
Figure supplement 2. Two-dimensional t-distributed stochastic neighbor embedding (t-SNE) embedding colored according to the absolute coordinates of itself in the individual (left) and shared (right) reward conditions, in the slow × two conditions.

Figure supplement 3 .
Figure supplement 3. Two-dimensional t-distributed stochastic neighbor embedding (t-SNE) embedding of the representations in the last hidden layers ofstate-value stream and action-value stream, in the slow × three conditions.

Figure supplement 4 .
Figure supplement 4. Corresponding state-action values (Q-values) for each state.

Figure supplement 6 .
Figure supplement 6. Movement trajectories (left) and heat maps (right) of the rule-based predator agents.

Figure supplement 7 .
Figure supplement 7. Two-dimensional t-distributed stochastic neighbor embedding (t-SNE) embedding of the representations in the last hidden layers ofthe linear network (top) and the nonlinear network (bottom) in behavioral cloning.

Figure supplement 8 .
Figure supplement 8. Histogram of the state value (V-value) in the individual (left) and shared (right) conditions.

Figure supplement 9 .
Figure supplement 9. Histogram of the standard deviation of state-action values (Q-values) in individual (left) and shared (right) conditions.

Figure 3
Figure 3 continued on next page right), in contrast to those in the individual condition (Fig.3c left).These indicate that the predators in the shared condition estimated state and action values and made decisions associated with distance-dependent representations (see Figure3-figure supplement 2 for the prey's decision).

Figure supplement 10 .
Figure supplement 10.Histogram of the distance between the prey and each predator in individual (left) and shared (right) conditions.

Figure supplement 11 .
Figure supplement 11.Histogram of the distance between the prey and each predator in the simulations, using rule-based predator agents.

Figure supplement 2 .
Figure supplement 2. Comparison of heat maps between individual (left) and shared (right) reward conditions in joint play.